1 Background

We hope to explore the relative influence of physical traits, environmental conditions and species identity on the growth rate of trees. A gradient boosted model seems like a good candidate for this work since they:

1.1 Extracting Principle Components for Environmental Traits

We, first, converted the environmental variables to principle components as they were highly correlated. We visualized the PCA and used the eginvectors to help figure which environmental condition best explained that PC. There were 5 - Soil Fertility, Light, Temperature, pH, Soil.Humidity.Depth, and Slope.

1.1.1 PC1-PC2

1.1.2 P3-PC4

1.1.3 PC5-PC6

1.1.4 Correlation on Plant Traits

We want to ensure that the plant traits are not correlated. Past work suggests that they are not easily represented using a PCA. So, we will not use the this feature reduction method.

1.2 About Gradient Boosted Models

A gradient boosted machine/model is a machine learning model that uses decision trees to fit the data.

A decision tree first starts with all of the observations, then, from the variables provided, it tries to figure out which variable split would result in the “purest” groupings of the data. So, in this case, it would try to place rows with higher growth rates in one node, and those with lower growth rates in another node.

GBMs are an ensemble of decision trees, nut they are fit sequentially. We call GBMs an ensemble of weak learners as each subsequent tree is an attempt to correct the errors of the previous tree. Thus, while one tree, by itself, can not describe the relationships, with the use of all the trees, we can. Below is a figure by Bradly Bohemke that attempts to illustrate how each subsequent tree improves the fit on the data. Boosted regression decision stumps as 0-1024 successive trees are added

2 Compare Models

We compared the fit of three used a gradient boosted models to determine how environmental gradients and physical traits influence RGR:

Though we present outputs for all three models below, we show that the best model is Model 1, using caret::resamples. This function allows us to iteratively build models using the training data and measure performance each round. In the end, we have re-sampled measures of each performance metric - r squared, RSME and MSE. Before comparing model performance, however, we first train the models - this is done to help in determining a range of parameters that would fit the data best for caret::resamples.

2.1 Train Models

Below, we show the best parameters for the models, given the data. ### Model 1: Tree Age + Plant Traits + Environmental Conditions{.tabset}

## $model_id
## [1] "final_grid_model_148"
## 
## $training_frame
## [1] "train.hex"
## 
## $validation_frame
## [1] "valid.hex"
## 
## $score_tree_interval
## [1] 10
## 
## $ntrees
## [1] 10000
## 
## $max_depth
## [1] 3
## 
## $min_rows
## [1] 4
## 
## $nbins
## [1] 32
## 
## $nbins_cats
## [1] 256
## 
## $stopping_rounds
## [1] 5
## 
## $stopping_metric
## [1] "deviance"
## 
## $stopping_tolerance
## [1] 1e-04
## 
## $max_runtime_secs
## [1] 3574.007
## 
## $seed
## [1] 1234
## 
## $learn_rate
## [1] 0.05
## 
## $distribution
## [1] "gaussian"
## 
## $sample_rate
## [1] 0.28
## 
## $col_sample_rate
## [1] 0.55
## 
## $col_sample_rate_per_tree
## [1] 0.27
## 
## $min_split_improvement
## [1] 0
## 
## $histogram_type
## [1] "RoundRobin"
## 
## $categorical_encoding
## [1] "Enum"
## 
## $calibration_method
## [1] "PlattScaling"
## 
## $x
##  [1] "Soil.Fertility"     "Light"              "Temperature"        "pH"                
##  [5] "Slope"              "Estem"              "Branching.Distance" "Stem.Wood.Density" 
##  [9] "Leaf.Area"          "LMA"                "LCC"                "LNC"               
## [13] "LPC"                "d15N"               "t.b2"               "Ks"                
## [17] "Ktwig"              "Huber.Value"        "X.Lum"              "VD"                
## [21] "X.Sapwood"          "d13C"               "Tree.Age"           "julian.date.2011"  
## 
## $y
## [1] "BAI_GR"

2.1.1 Model 2: Tree Age + Species Identity + Environmental Conditions

## $model_id
## [1] "final_grid_model_12"
## 
## $training_frame
## [1] "train.hex"
## 
## $validation_frame
## [1] "valid.hex"
## 
## $score_tree_interval
## [1] 10
## 
## $ntrees
## [1] 10000
## 
## $max_depth
## [1] 11
## 
## $min_rows
## [1] 4
## 
## $nbins
## [1] 256
## 
## $nbins_cats
## [1] 4096
## 
## $stopping_rounds
## [1] 5
## 
## $stopping_metric
## [1] "deviance"
## 
## $stopping_tolerance
## [1] 1e-04
## 
## $max_runtime_secs
## [1] 3594.991
## 
## $seed
## [1] 1234
## 
## $learn_rate
## [1] 0.05
## 
## $distribution
## [1] "gaussian"
## 
## $sample_rate
## [1] 0.82
## 
## $col_sample_rate
## [1] 0.7
## 
## $col_sample_rate_per_tree
## [1] 0.92
## 
## $min_split_improvement
## [1] 0
## 
## $histogram_type
## [1] "QuantilesGlobal"
## 
## $categorical_encoding
## [1] "Enum"
## 
## $calibration_method
## [1] "PlattScaling"
## 
## $x
## [1] "Soil.Fertility"   "Light"            "Temperature"      "pH"               "Slope"           
## [6] "Species"          "Tree.Age"         "julian.date.2011"
## 
## $y
## [1] "BAI_GR"

2.1.2 Model 3: Tree Age + Species Identity + Plant Trait + Environmental Conditions

## $model_id
## [1] "final_grid_model_77"
## 
## $training_frame
## [1] "train.hex"
## 
## $validation_frame
## [1] "valid.hex"
## 
## $score_tree_interval
## [1] 10
## 
## $ntrees
## [1] 10000
## 
## $max_depth
## [1] 6
## 
## $min_rows
## [1] 2
## 
## $nbins
## [1] 32
## 
## $nbins_cats
## [1] 4096
## 
## $stopping_rounds
## [1] 5
## 
## $stopping_metric
## [1] "deviance"
## 
## $stopping_tolerance
## [1] 1e-04
## 
## $max_runtime_secs
## [1] 3547.872
## 
## $seed
## [1] 1234
## 
## $learn_rate
## [1] 0.05
## 
## $distribution
## [1] "gaussian"
## 
## $sample_rate
## [1] 0.2
## 
## $col_sample_rate
## [1] 0.7
## 
## $col_sample_rate_per_tree
## [1] 0.85
## 
## $min_split_improvement
## [1] 1e-08
## 
## $histogram_type
## [1] "RoundRobin"
## 
## $categorical_encoding
## [1] "Enum"
## 
## $calibration_method
## [1] "PlattScaling"
## 
## $x
##  [1] "Soil.Fertility"     "Light"              "Temperature"        "pH"                
##  [5] "Slope"              "Estem"              "Branching.Distance" "Stem.Wood.Density" 
##  [9] "Leaf.Area"          "LMA"                "LCC"                "LNC"               
## [13] "LPC"                "d15N"               "t.b2"               "Ks"                
## [17] "Ktwig"              "Huber.Value"        "X.Lum"              "VD"                
## [21] "X.Sapwood"          "d13C"               "Species"            "Tree.Age"          
## [25] "julian.date.2011"  
## 
## $y
## [1] "BAI_GR"

2.2 Model Performance

Here, we present results from model comparisons and show that the best model is Model 1 - Plant Traits + Environmental Conditions.

2.2.1 Summary

## 
## Call:
## summary.resamples(object = ModelPerformanceCompare)
## 
## Models: Model 1 - Plant Traits + Environmental Conditions, Model 2 - Species Identity + Environmental Conditions, Model 3 - Species Identity + Environmental Conditions + Plant Traits 
## Number of resamples: 25 
## 
## MAE 
##                                                                           Min.   1st Qu.    Median
## Model 1 - Plant Traits + Environmental Conditions                    0.6207895 0.6957431 0.7265563
## Model 2 - Species Identity + Environmental Conditions                0.7599306 0.8436598 0.8871447
## Model 3 - Species Identity + Environmental Conditions + Plant Traits 0.6983746 0.7904407 0.8186646
##                                                                           Mean   3rd Qu.      Max. NA's
## Model 1 - Plant Traits + Environmental Conditions                    0.7408249 0.7904460 0.8598637    0
## Model 2 - Species Identity + Environmental Conditions                0.8960906 0.9436792 1.0789833    0
## Model 3 - Species Identity + Environmental Conditions + Plant Traits 0.8194333 0.8587782 0.9173980    0
## 
## RMSE 
##                                                                           Min.   1st Qu.   Median
## Model 1 - Plant Traits + Environmental Conditions                    0.8024465 0.9154721 1.014256
## Model 2 - Species Identity + Environmental Conditions                1.0508723 1.1442295 1.214355
## Model 3 - Species Identity + Environmental Conditions + Plant Traits 0.9265215 1.0402613 1.094787
##                                                                          Mean  3rd Qu.     Max. NA's
## Model 1 - Plant Traits + Environmental Conditions                    1.012689 1.061483 1.250678    0
## Model 2 - Species Identity + Environmental Conditions                1.213812 1.258893 1.420221    0
## Model 3 - Species Identity + Environmental Conditions + Plant Traits 1.096298 1.126219 1.333776    0
## 
## Rsquared 
##                                                                              Min.    1st Qu.     Median
## Model 1 - Plant Traits + Environmental Conditions                    2.158172e-02 0.09393151 0.13196451
## Model 2 - Species Identity + Environmental Conditions                1.588863e-05 0.00258848 0.01265049
## Model 3 - Species Identity + Environmental Conditions + Plant Traits 3.327330e-04 0.02167480 0.07528740
##                                                                            Mean    3rd Qu.      Max.
## Model 1 - Plant Traits + Environmental Conditions                    0.13189257 0.17315454 0.2605188
## Model 2 - Species Identity + Environmental Conditions                0.02690331 0.02926822 0.1442337
## Model 3 - Species Identity + Environmental Conditions + Plant Traits 0.07121551 0.10480008 0.1884573
##                                                                      NA's
## Model 1 - Plant Traits + Environmental Conditions                       0
## Model 2 - Species Identity + Environmental Conditions                   0
## Model 3 - Species Identity + Environmental Conditions + Plant Traits    0

2.2.2 Boxplot

2.2.3 Dotplot

2.3 Model Results: Tree Age + Plant Traits + Environmental Conditions

2.3.1 Build Model

Now, we can build the model.

set.seed(123)
gbm_regressor_bai_residuals <-
  gbm(BAI_GR ~ .,
      data = 
        rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BAI_GR))%>%
        select(any_of(c(EnvironmentalVariablesKeep,
                        PlantTraitsKeep, "Tree.Age", "BAI_GR", "julian.date.2011"))),
      n.trees = 1000,
      interaction.depth = 3, #max depth 
      shrinkage = 0.05, #learning rate
      n.minobsinnode = 13, #col_sample_rate 
      bag.fraction = 0.28, # sample_rate,
      verbose = FALSE,
      n.cores = NULL,
      cv.folds = 5)

2.3.2 Relative Importance

First, we look at the importance of variables in the model.

2.3.3 Partial Dependence

Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.

Soil Fertility

Light

Temperature

pH

Slope

Modulus of Elasticity for Stem

Branching Distance

Stem Wood Density

Leaf Area

Leaf Mass Per Area

Leaf Carbon Concentration

Leaf Nitrogen Concentration

Leaf Phosphorus Concentration

Delta 15N

Thickness to Span Ratio

Conductivity Per Sapwood Area

Conductivity per Branch

Huber Value

Percent Lumen

Vessel Diameter

Percent Sapwood

Delta 13C

Tree Age

Julian Date in 2011

2.3.4 Performance

2.3.5 Interactions | Table

Let’s explore the interactions in these data.

2.3.6 Interaction | Group | Test

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Value by Class
## Kruskal-Wallis chi-squared = 22.488, df = 2, p-value = 1.309e-05
##                                                                                  Comparison         Z
## 1 Environmental Conditions:Environmental Conditions - Plant Traits:Environmental Conditions -3.422308
## 2             Environmental Conditions:Environmental Conditions - Plant Traits:Plant Traits -4.679483
## 3                         Plant Traits:Environmental Conditions - Plant Traits:Plant Traits -1.405564
##        P.unadj        P.adj
## 1 6.209183e-04 1.241837e-03
## 2 2.875992e-06 8.627977e-06
## 3 1.598537e-01 1.598537e-01

2.3.7 Interaction | Group | Violin

2.3.8 Interaction | Group | Boxplot

2.3.9 Interaction | Group | Density

2.3.10 Interaction | Group | Top

2.3.11 Interactions | Plots

Now, we plot interactions with values>0.10.

Leaf Phosphorus Concentration:Conductivity per Branch

Leaf Nitrogen Concentration:Conductivity Per Sapwood Area

Huber Value:Tree Age

Soil Fertility:Stem Wood Density

Delta 15N:Huber Value

Vessel Diameter:Percent Sapwood

Conductivity Per Sapwood Area:Julian Date in 2011

Slope:Vessel Diameter

Light:Delta 13C

Percent Lumen:Percent Sapwood

Soil Fertility:Huber Value

Soil Fertility:Percent Sapwood

Soil Fertility:Leaf Carbon Concentration

Branching Distance:Leaf Nitrogen Concentration

Leaf Nitrogen Concentration:Vessel Diameter

Leaf Area:Percent Sapwood

Stem Wood Density:Delta 15N

Leaf Area:Tree Age

Slope:Tree Age

Thickness to Span Ratio:Conductivity per Branch

pH:Conductivity per Branch

Delta 15N:Conductivity per Branch

Leaf Area:Thickness to Span Ratio

Leaf Carbon Concentration:Percent Sapwood

Stem Wood Density:Leaf Carbon Concentration

Modulus of Elasticity for Stem:Huber Value

Light:Branching Distance

Conductivity per Branch:Julian Date in 2011

Leaf Phosphorus Concentration:Thickness to Span Ratio

Stem Wood Density:Huber Value

Light:Julian Date in 2011

Branching Distance:Percent Lumen

Thickness to Span Ratio:Conductivity Per Sapwood Area

Modulus of Elasticity for Stem:Branching Distance

pH:Tree Age

Light:Tree Age

Leaf Nitrogen Concentration:Percent Lumen

2.3.12 Group Relative Importance

Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.

3 Compare to GLM

The model below is a GLM with the main effects and top 22 interactions from the best model. This will be used as a baseline to compare to other models Julie found in literature. We caution that this is an exploratory model, and is not a true translation of the GBM as we do not model non-linearity.

3.1 Model Summary

## 
## Call:
## glm(formula = TestModelFormula, data = rgr_msh_na)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9145  -0.5267  -0.1461   0.3479   3.4778  
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                       5.958e+00  3.367e+00   1.770  0.07822 . 
## Soil.Fertility                    7.276e-01  1.400e+00   0.520  0.60372   
## Light                             1.675e+00  1.492e+00   1.123  0.26266   
## Temperature                       7.468e-02  9.302e-02   0.803  0.42293   
## pH                               -1.725e-01  1.536e-01  -1.123  0.26277   
## Slope                            -9.130e-03  1.553e-01  -0.059  0.95318   
## Estem                             1.961e-05  2.406e-05   0.815  0.41592   
## Branching.Distance               -2.833e-02  1.644e-02  -1.724  0.08620 . 
## Stem.Wood.Density                 5.656e-01  9.061e-01   0.624  0.53318   
## Leaf.Area                         6.877e-04  4.916e-03   0.140  0.88888   
## LMA                               1.925e-02  5.804e-03   3.316  0.00107 **
## LCC                              -3.369e-02  3.063e-02  -1.100  0.27266   
## LNC                              -9.213e-03  2.649e-01  -0.035  0.97229   
## LPC                              -1.404e-02  2.927e-02  -0.480  0.63206   
## d15N                             -1.363e-01  1.540e-01  -0.885  0.37729   
## t.b2                             -1.047e+01  1.557e+01  -0.673  0.50171   
## Ks                               -1.394e-02  1.288e-02  -1.082  0.28041   
## Ktwig                             1.488e-04  5.997e-04   0.248  0.80431   
## Huber.Value                       2.936e+00  5.575e+00   0.527  0.59898   
## X.Lum                            -4.015e-01  7.229e+00  -0.056  0.95576   
## VD                                1.626e-04  7.483e-05   2.172  0.03094 * 
## X.Sapwood                        -4.738e-01  1.180e+00  -0.401  0.68857   
## d13C                              1.420e-01  6.873e-02   2.066  0.04003 * 
## Tree.Age                         -4.475e-03  1.027e-02  -0.436  0.66335   
## julian.date.2011                  3.263e-03  8.494e-03   0.384  0.70127   
## LPC:Ktwig                        -8.664e-07  2.904e-05  -0.030  0.97623   
## LNC:Ks                            3.844e-03  2.099e-03   1.831  0.06843 . 
## Huber.Value:Tree.Age             -3.513e-03  1.821e-01  -0.019  0.98463   
## Soil.Fertility:Stem.Wood.Density  8.891e-01  5.171e-01   1.719  0.08698 . 
## d15N:Huber.Value                 -1.041e+00  7.993e-01  -1.302  0.19419   
## VD:X.Sapwood                      6.428e-05  5.786e-05   1.111  0.26787   
## Ks:julian.date.2011               2.216e-06  6.299e-05   0.035  0.97196   
## Slope:VD                          1.377e-05  7.893e-06   1.744  0.08254 . 
## Light:d13C                        5.226e-02  4.916e-02   1.063  0.28901   
## X.Lum:X.Sapwood                  -3.803e+00  8.882e+00  -0.428  0.66894   
## Soil.Fertility:Huber.Value        1.196e+00  1.534e+00   0.779  0.43657   
## Soil.Fertility:X.Sapwood         -3.235e-03  3.966e-01  -0.008  0.99350   
## Soil.Fertility:LCC               -2.948e-02  2.934e-02  -1.005  0.31624   
## Branching.Distance:LNC            1.390e-02  5.980e-03   2.325  0.02102 * 
## LNC:VD                           -6.617e-05  2.835e-05  -2.334  0.02052 * 
## Leaf.Area:X.Sapwood               1.531e-03  7.118e-03   0.215  0.82990   
## Stem.Wood.Density:d15N            3.758e-01  2.250e-01   1.670  0.09630 . 
## Leaf.Area:Tree.Age               -1.297e-04  1.262e-04  -1.028  0.30512   
## Slope:Tree.Age                   -3.017e-03  6.087e-03  -0.496  0.62063   
## t.b2:Ktwig                        1.021e-02  1.334e-02   0.765  0.44510   
## pH:Ktwig                         -4.869e-05  1.024e-04  -0.475  0.63497   
## d15N:Ktwig                       -2.259e-05  4.167e-05  -0.542  0.58827   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.8795727)
## 
##     Null deviance: 283.97  on 259  degrees of freedom
## Residual deviance: 187.35  on 213  degrees of freedom
##   (2 observations deleted due to missingness)
## AIC: 748.64
## 
## Number of Fisher Scoring iterations: 2

3.2 R-Squared

## McFadden's R-squared for model is 0.34025566703557

3.3 Variance Partitioning